Organization and
Reproducibility:
renv and
CookieCutter
Data Science

Organizing a Repo

To this point, we haven’t really talked about how we should organize a repository. A repository is just a set of files to track over time, but how do we organize those files?

I am arguably not the best guide for this, as I am generally a disorganized person and it shows in my older repos.

On some level, this is to be expected with data analysis/data science - we rarely work in a linear progression with the same set of files from project to project.

But I’ve started to use a specific style for organization that seems to suit data science projects well.

CookieCutter Data Science

CookieCutter Data Science

It’s no secret that good analyses are often the result of very scattershot and serendipitous explorations. Tentative experiments and rapidly testing approaches that might not work out are all part of the process for getting to the good stuff, and there is no magic bullet to turn data exploration into a simple, linear progression.

A well-defined, standard project structure means that a newcomer can begin to understand an analysis without digging in to extensive documentation. It also means that they don’t necessarily have to read 100% of the code before knowing where to look for very specific things.

It’s less important to have the perfect organization for a given project than it is to have some sort of standard that everyone understands and uses.

The goal is to organize projects in a way that will make it easier for others and your future self to remember.

Sadly, this template is intended for Python, but we can adapt it for R easily enough. Let’s zoom in a bit on specific pieces.

├── Makefile        <- Makefile with commands like `make data` or `make train`
├── README.md       <- The top-level README for developers
├── data
│   ├── external    <- Data from third party sources.
│   ├── interim     <- Intermediate data that has been transformed.
│   ├── processed   <- The final, canonical data sets for modeling.
│   └── raw         <- The original, immutable data dump.
  • Always include a README as an organizing guide
  • Data generally isn’t stored in repos, but if it is you can follow this organization that tracks the lineage of the data
  • One of the guiding principles for CookieCutter Data Science is to treat data as immutable. The point of a project is interact and work with data, but we never change it from its raw sources.
  • We will shortly discuss the R equivalent of a Makefile, with the aim that our project is organized to do one specific thing

├── models    <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks   <- Jupyter notebooks. Naming convention is a number (for ordering),
|                 the creator's initials, and a short `-` delimited description, e.g.
|                 `1.0-jqp-initial-data-exploration`.
│
├── references    <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports         <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures     <- Generated graphics and figures to be used in reporting
│
  • Similarly, models typically aren’t stored in a repo, but we might want to save summaries or model cards
  • Notebooks are places for exploratory analysis and should be treated mostly as a sandbox
  • Store all background documentation, project discussion, articles that has been used and discussed in references

├── requirements.txt    <- The requirements file for reproducing  the analysis environment.
│
├── src                <- Source code for use in this project.
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
  • The repo must detail the requirements for someone to reproduce the project.
  • In Python this is requirements.txt we will discuss the R equivalent via renv next.
  • All code used in the project is stored and organized in src

Data is Immutable

Don’t ever edit your raw data, especially not manually, and especially not in Excel. Don’t overwrite your raw data. Don’t save multiple versions of the raw data. Treat the data (and its format) as immutable.

The code you write should move the raw data through a pipeline to your final analysis. You shouldn’t have to run all of the steps every time you want to make a new figure (see Analysis is a DAG), but anyone should be able to reproduce the final products with only the code in src and the data in data/raw.

Also, if data is immutable, it doesn’t need source control in the same way that code does. Therefore, by default, the data folder is included in the .gitignore file

Analysis is a DAG

Often in an analysis you have long-running steps that preprocess data or train models. If these steps have been run already (and you have stored the output somewhere like the data/interim directory), you don’t want to wait to rerun them every time. We prefer make for managing steps that depend on each other, especially the long-running ones.

  • This will be the point of emphasis in using targets, bringing Make-like functionality to R.

Functional Programming

It’s hard to describe exactly what a functional style is, but generally I think it means decomposing a big problem into smaller pieces, then solving each piece with a function or combination of functions. When using a functional style, you strive to decompose components of the problem into isolated functions that operate independently. Each function taken by itself is simple and straightforward to understand; complexity is handled by composing functions in various ways.

  • CookieCutter Data Science doesn’t go into much detail on what your src code should look like, but I have found it naturally suits a functional programming style.

  • Rather than writing scripts that execute tasks, it’s generally better to write a series of functions that are then called and used in a pipeline.

A Repo Should Be One Project

Another thing that Cookie Cutter Data Science helps address: what should even be a repo? When we’re working on a project, how do we define and organize our code?

Do we create one repository for all of our data science projects? Do we create one repository per project?

This more or less becomes an argument between monorepos vs multi-repos.

Monorepo vs Multi-repo

A monorepo is one repository that contains code for a lot of different projects and tasks.

Imagine you have one big project you’re working on, containing a lot of separate pieces and code. The monorepo approach says, throw it all in into the same repo.

Monorepo vs Multi-repo

As opposed to a multi-repo, where aspects of a larger project are isolated and separated into individual repositories.

##

A Repo Should Be One Project

The Cookie Cutter Data Science approach is much more conducive towards the multi-repo approach:

  • A repository exists for a specific task.
  • The code in the repository executes that task.
  • The requirements for running that code are defined in the repository.

It becomes a lot harder to define requirements and reproduce the environment to run code when you have a gigantic, monolithic repository.

But what if we want to re-use code across multiple repositories!

More on this later, but basically this is where submodules might come into play. . . .

Or, just create another repo in the form of a package that can be used across multiple projects.

CookieCutter Data Science (for R)

Given these principles, most of my repos end up being organized in the following way:

├── _targets    <- stores the metadata and objects of your pipeline
├── renv        <- information relating to your R packages and dependencies
├── data        <- data sources used as an input into the pipeline
├── src         <- functions used in project/targets pipeline
|   ├── data    <- functions relating to loading and cleaning data
|   ├── models    <- functions involved with training models
|   ├── reports   <- functions used in generating tables and visualizations for reports
├── _targets.R    <- script that runs the targets pipeline
├── renv.lock     <- lockfile detailing project requirements and dependencies

Again, I’m not saying that this is THE OBJECTIVELY CORRECT WAY TO ORGANIZE AN R PROJECT. But it’s been a useful starting point for me in my work.

One of the key pillars to this organization is renv.

renv

renv

Let’s go back to the issues we had in running certain files in the starwars or board_games repo.

How often do you want to run someone else’s code, only to find that you need to install additional packages?

How often do you try to run someone else’s code only to discover that they’re using a deprecated function?

How many times have you gotten a headache because dplyr can’t make up its mind between mutate_if, mutate_at, mutate_all, and mutate(across())?

The renv package aims to solve most of these problems by helping you to create reproducible environments for R projects.

renv allows you to scan and find packages used in your project. This produces a list of packages with their current versions and dependencies. Using renv with a project adds three pieces to your repo:

  • renv/library: a library that contains all packages currently used by your project.

This is the key magic that makes renv work: instead of having one library containing the packages used in every project, renv gives you a separate library for each project

  • renv.lock: a lockfile that records metadata about every package used in the project; this allows the project’s packages to be reinstalled on a new machine
  • .Rprofile: adds a file that runs everytime you open up the project; this file runs renv::activate() and configures your project to use the renv/library

We then add (pieces) of renv/library, renv.lock, and .Rprofile to our repository and commit them.

If we make a change to our code, we use renv to track whether that code has introduced, removed, or changed our dependencies. When we commit the change to our code, we will also commit a change to our renv.lock file.

In this way, using Git + renv allows us to store a history of how our project dependencies have changed with every commit.

So, how do we do this?

We will need to get to know a few functions from renv.

renv key functions

  • renv::init() initializes renv in a project. This will create a scan for dependencies, install them in a project library, and create a lockfile describing the current state of the project.
  • renv::dependencies() scans for dependencies and finds which scripts make use of packages
  • renv::snapshot() creates or updates a lockfile with the current state of packages used in the project
  • renv::status() compares the current dependencies of your project vs the dependencies detailed in the lockfile.
  • renv::restore() restores a project’s dependencies from a lockfile. This is typically the first command when working with a repo that has an existing lockfile.

renv - Demo

  • Applying renv to an existing project: guns-data
  • Initialize renv with renv::init()
  • Finding dependencies via renv::dependencies()
  • Install a new package; check renv::status()
  • Use new package in a script: check renv::status()
  • Updating lockfile with renv::snapshot()

Your Turn

  • Load the starwars
  • Create a new branch
  • Initialize renv with renv::init()
  • Find dependencies via renv::dependencies()
  • Add a new script that makes use of a new package (visualizations: ggforce, ggridges)
  • Check the status via renv::status()
  • Update the lockfile via renv::snapshot()
  • Commit renv, renv.lock, and .Rprofile
10:00

  • renv::restore() restores a project’s dependencies from a lockfile. This is typically the first command when working with a repo that has an existing lockfile.

Your Turn

  • Fork and clone https://github.com/ds-workshop/phils_collection)
  • Check your remotes to see if you are configured to track changes from upstream/main
  • Create a new branch with the syntax username-feature
  • Restore the project’s dependencies using the lockfile
10:00